{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-af7de1c2b8e61d18",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "# Exercise 9) Function Approximators in Prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-fbd6485cf4881324",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Up to now, we have used tabular methods to store data. These methods are rather simple concerning implementation and use, but they lack efficiency in several fields:\n",
    "\n",
    "- Tabular methods need a lot of storage capacity. Saving one number per possible state (or state-action combination) is very expensive for systems with a large discrete problem space. It even gets impossible when looking at continuous problem spaces.\n",
    "- Tabular methods are unable to generalize. Every update to the table only contains information about one specific state, which means that we only learn about states we have seen and not about the states that are \"near\" them. We can decide if we want to solve this issue by extending the training time or by lowering our expectations of the outcome.\n",
    "\n",
    "or alternatively, we make use of function approximators!\n",
    "\n",
    "For our investigations, we will have a look at the MountainCar environment from OpenAI's `gym`.\n",
    "\n",
    "This system has a continuous two-dimensional state space and a discrete one-dimensional action space.\n",
    "\n",
    "The MountainCar can be compared to the pendulum concerning that a successful policy must be able to perform a swing-up movement. The car has limited ability to accelerate uphill, it has to build up velocity by accelerating downhill. \n",
    "In contrast to the pendulum scenario, the MountainCar terminates upon reaching the goal on the mountaintop to the right. Every timestep will be rewarded with a reward of $r_{k+1}=-1$, such that it is most beneficial to end an episode as fast as possible. \n",
    "\n",
    "For this exercise we want to concentrate on the evaluation of an existing policy.\n",
    "\n",
    "![](https://marcinbogdanski.github.io/rl-sketchpad/Deep_Q_Network/assets/mountaincar.gif)\n",
    "\n",
    "(Source of GIF: https://marcinbogdanski.github.io/rl-sketchpad/Deep_Q_Network/1010_DQN_ClassicControl.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-33ec88c4b59b3a7e",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Please make sure to have `PyTorch` installed:\n",
    "\n",
    "`pip install torch`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-70e813bdcedc921c",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import gymnasium as gym\n",
    "from tqdm.notebook import tqdm\n",
    "import matplotlib.pyplot as plt\n",
    "plt.style.use('seaborn-v0_8-talk')\n",
    "from mpl_toolkits.mplot3d import Axes3D\n",
    "\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.optim as optim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-c5ca2799454a3b44",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Test the environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-a5a477ab9ba8a554",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "env = gym.make('MountainCar-v0', render_mode=\"human\")\n",
    "state = env.reset()\n",
    "\n",
    "while True:\n",
    "    #env.render() #uncomment for visualization\n",
    "    state, reward, terminated, truncated, _ = env.step(env.action_space.sample())\n",
    "    done = terminated or truncated\n",
    "    if done:\n",
    "        break\n",
    "        \n",
    "env.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-0f213185bfcd49f4",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## 1) Linear Function Approximation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-5683963ffdb84e94",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "The next cell contains a simple policy for the swing-up of the MountainCar. We want to predict the value function with the use of a linear function approximator of the form:\n",
    "\n",
    "$\\hat{v}(\\mathbf{x_k})=\\mathbf{w}^\\text{T} \\tilde{\\mathbf{x}}_k$.\n",
    "\n",
    "Herein, the weight vector is denoted by $\\mathbf{w}$. The feature vector $\\tilde{\\mathbf{x}}_k$ is derived from the state vector $\\mathbf{x}_k$:\n",
    "\n",
    "$\\tilde{\\mathbf{x}}_k = f (\\mathbf{x}_k)$\n",
    "\n",
    "The state vector $\\mathbf{x}_k$ consists of the (vertical) position and the velocity.\n",
    "\n",
    "Write a Semi-Gradient TD(0) prediction algorithm that learns the weights of this linear value function approximator.\n",
    "Make use of a `feature` function, that accepts the state vector as an input and returns a feature vector that is derived from the state. The feature vector should be equal to zero ($\\tilde{\\mathbf{x}}_T = \\mathbf{0}$) if the finish line has been passed (this happens if the position is greater than $0.5$). Can you find a feature definition that enables a good value estimation?\n",
    "\n",
    "Hint:\n",
    "As it seems, the chances of successfully reaching the finish line rise when the car's energy increases. You may want to look at the [MountainCar sourcecode](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/mountain_car.py) to acquire expert knowledge for your feature definition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-477b3082c50e0c32",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def policy(state):\n",
    "    \"\"\"Decides on one of the actions in dependence of the current state.\n",
    "\n",
    "    Possible actions:\n",
    "        0: accelerate left\n",
    "        1: idle (do not accelerate)\n",
    "        2: accelerate right\n",
    "    \"\"\"    \n",
    "    pos, vel = state  \n",
    "    \n",
    "    # fixed policy, do not change\n",
    "    action = 2*int(vel > 0)\n",
    "\n",
    "    return action\n",
    "\n",
    "def feature(state):\n",
    "    \"\"\"Returns an extended feature vector derived from the state.\"\"\"\n",
    "    pos, vel = state\n",
    "    \n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()    \n",
    "    return feature_vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def interact(state):\n",
    "    \"\"\"Interact with the environment to get to the next state.\n",
    "\n",
    "    Args:\n",
    "        state: The current state of the environment\n",
    "\n",
    "    Returns:\n",
    "        next_state: The state of the environment after interaction\n",
    "        reward: The reward for the current interaction\n",
    "        done: Whether the episode is finished    \n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return next_state, reward, done\n",
    "\n",
    "def forward(w, feat_state):\n",
    "    \"\"\"Use the model to predict the state-value for the given extended state.\n",
    "\n",
    "    Args:\n",
    "        w: The parameters of your state-value estimator.\n",
    "        state: The extended state for which to estimate the state-value\n",
    "\n",
    "    Returns:\n",
    "        value: State-value estimation\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return value\n",
    "\n",
    "def learn(w, state, next_state, reward, alpha, gamma):\n",
    "    \"\"\"Learns from the interaction via bootstrapping.\n",
    "\n",
    "    Args:\n",
    "        w: The parameters of your state-value estimator\n",
    "        state: The state of the environment before interaction\n",
    "        next_state: The state of the environment after interaction\n",
    "        reward: The reward for the interaction\n",
    "        alpha: Learning rate\n",
    "        gamma: Discount factor\n",
    "\n",
    "    Returns:\n",
    "        w: The updated parameter vector\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return w"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-feb551d792c23d8c",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "### Semi-Gradient TD(0)\n",
    "alpha = 0.1\n",
    "gamma = 0.9\n",
    "nb_episodes = 500\n",
    "\n",
    "env = gym.make('MountainCar-v0')\n",
    "\n",
    "state, _ = env.reset()\n",
    "feat_state = feature(state)\n",
    "w = np.zeros([len(feat_state)])\n",
    "visited_states = []\n",
    "\n",
    "for j in tqdm(range(nb_episodes)):\n",
    "    \n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-2c70f952f9004c1e",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Evaluate the result parameter vector by investigating the performance of the value function approximator on the whole state space, preferrably in a plot. Which parts of the estimation seem accurate, which do not?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-f73c98b6fdf77664",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "### This script will plot the value function as a colored map\n",
    "resolution = 100\n",
    "\n",
    "pos_vec = np.linspace(-1.2, 0.6, resolution)\n",
    "vel_vec = np.linspace(-0.07, 0.07, resolution)\n",
    "\n",
    "pos_mat, vel_mat = np.meshgrid(pos_vec, vel_vec)\n",
    "value_mat = np.zeros([resolution, resolution])\n",
    "\n",
    "for pos_idx, pos in enumerate(tqdm(pos_vec)):\n",
    "    for vel_idx, vel in enumerate(vel_vec):\n",
    "        feat_state = feature(np.array([pos, vel]))\n",
    "        value_mat[vel_idx, pos_idx] = forward(w, feat_state)\n",
    "\n",
    "# Plot\n",
    "plt.pcolormesh(pos_mat, vel_mat, value_mat)\n",
    "plt.xlabel(\"position\")\n",
    "plt.ylabel(\"velocity\")\n",
    "cbar = plt.colorbar()\n",
    "cbar.set_label(\"V\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-4c0041ae3e474368",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "### This script will plot the value function as a surface\n",
    "fig = plt.figure()\n",
    "ax = plt.axes(projection='3d')\n",
    "ax.plot_surface(pos_mat, vel_mat, np.squeeze(value_mat), cmap=\"viridis\")\n",
    "ax.set_xlabel('\\n\\nposition')\n",
    "ax.set_ylabel('\\n\\nvelocity')\n",
    "ax.set_zlabel('V')\n",
    "\n",
    "ax.view_init(30, -135)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-3d2cc84be4954f81",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "plt.pcolormesh(pos_mat, vel_mat, value_mat)\n",
    "visited_states = np.array(visited_states)\n",
    "plt.xlabel(\"position\")\n",
    "plt.ylabel(\"velocity\")\n",
    "plt.colorbar()\n",
    "plt.scatter(visited_states[:, 0], visited_states[:, 1], color=\"red\", s=0.5)\n",
    "cbar.set_label(\"V\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-ea260243c55d68a7",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## 2) Recursive Least Squares TD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-bf761f1c6b1ed339",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "In the previous task we computed the parameters of the value function iteratively, by using old parameters and new observations to calculate new parameters. This works, but we can also be more data efficient. Recursive Least Squares Temporal Difference Learning (RLS-TD) allows to determine new parameters on the basis of new AND old observations, such that the parameters we receive are optimally fitted while taking past experiences into account. This method does not use a step size but only a forgetting factor $\\lambda \\in [0,1]$ which defines the impact of past experiences. \n",
    "\n",
    "Write an RLS-TD algorithm to solve the prediction problem. Check the stability of your code for the forgetting factors $\\lambda \\in \\{0.9, 0.99, 1\\}$. As this algorithm contains a lot of matrix multiplication, pay attention to vectors being represented within the correct dimensions. One could e.g. use the `row_vector.reshape(-1, 1)` command to turn a row vector into a column vector, or use the postix `.T` to transpose an array (note, this has no effect on one-dimensional arrays).\n",
    "\n",
    "The feature definition from task (1) can be reused here as long as $\\tilde{\\mathbf{x}}_T = \\mathbf{0}$ holds.\n",
    "\n",
    "**Hint:** `interact` and `forward` from 1) can be reused in this task as well. For the use of `forward`, you only need to ensure that `w` is shaped properly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-8850c630ca8dc45a",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def feature(state):\n",
    "    pos, vel = state\n",
    "    win = int(pos > 0.5)\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()    \n",
    "    return feature_vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def learn(w, P, state, next_state, reward, _lambda, gamma, ident_mat):\n",
    "    \"\"\"Learns from the interaction via RLS-TD.\n",
    "\n",
    "    Args:\n",
    "        w: The parameters of your state-value estimator\n",
    "        P: Covariance matrix of the last step\n",
    "        state: The state of the environment before interaction\n",
    "        next_state: The state of the environment after interaction\n",
    "        reward: The reward for the interaction\n",
    "        _lambda: Forgetting factor\n",
    "        gamma: Discount factor\n",
    "        ident_mat: The identity matrix for you feature dimension\n",
    "\n",
    "    Returns:\n",
    "        w: The updated parameter vector\n",
    "        P: The updated covariance matrix\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return w, P"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-d4d1b5df6fd0085c",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "gamma = 0.9\n",
    "_lambda = 1 # we call it like that because \"lambda\" is a built-in command/ reserved syntax in python\n",
    "nb_episodes = 500\n",
    "\n",
    "env = gym.make('MountainCar-v0')\n",
    "state, _ = env.reset()\n",
    "feat_state = feature(state)\n",
    "feat_dims = len(feat_state)\n",
    "\n",
    "P = np.eye(feat_dims)\n",
    "w = np.zeros((feat_dims, 1))  # column vector\n",
    "ident_mat = np.eye(feat_dims)  # identity matrix\n",
    "\n",
    "for j in tqdm(range(nb_episodes)):\n",
    "    \n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-cb91c01d6d7933f2",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "To simplify the comparison with your results from task (1), visualize the results from this task as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-cbe51a745b60cc78",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# create a meshgrid on the state space\n",
    "pos_vec = np.linspace(-1.2, 0.6, 100)\n",
    "vel_vec = np.linspace(-0.07, 0.07, 100)\n",
    "\n",
    "pos_mat, vel_mat = np.meshgrid(pos_vec, vel_vec)\n",
    "value_mat = np.zeros([100, 100])\n",
    "\n",
    "for pos_idx, pos in enumerate(pos_vec):\n",
    "    for vel_idx, vel in enumerate(vel_vec):\n",
    "        feat_state = feature(np.array([pos, vel]))\n",
    "        feat_state = np.expand_dims(feat_state, axis = -1)\n",
    "        value_mat[vel_idx, pos_idx] = forward(w.T, feat_state).item()\n",
    "\n",
    "# Plot\n",
    "plt.pcolormesh(pos_mat, vel_mat, value_mat)\n",
    "plt.xlabel(\"position\")\n",
    "plt.ylabel(\"velocity\")\n",
    "cbar = plt.colorbar()\n",
    "cbar.set_label(\"V\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-d2e06bbadb67860b",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "fig = plt.figure()\n",
    "ax = plt.axes(projection='3d')\n",
    "ax.plot_surface(pos_mat, vel_mat, np.squeeze(value_mat), cmap=\"viridis\")\n",
    "ax.set_xlabel('\\n\\nposition')\n",
    "ax.set_ylabel('\\n\\nvelocity')\n",
    "ax.set_zlabel('V')\n",
    "ax.view_init(30, -135)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-801d6015e510d0c4",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## 3) Nonlinear Function Approximation with Artificial Neural Networks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-3b11571951447780",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Maybe we can achieve a more precise value estimation if we use a nonlinear function approximator. The dynamics of the environment are nonlinear, so it would be reasonable to expect a better result. We will use an artificial neural network as our function approximator without feature engineering. \n",
    "\n",
    "Neural networks tend to learn more reliably if the input variables are normalized. In this case, we will use minmax normalization. Write the function `normalize` that normalizes the MounatinCar state:\n",
    "\n",
    "\\begin{align}\n",
    "\\text{state}&\\in\n",
    "\\begin{bmatrix}\n",
    "[-1.2, 0.6] \\\\\n",
    "[-0.07, 0.07]\n",
    "\\end{bmatrix}\n",
    "\\\\\n",
    "\\text{normalize(state)}&\\in\n",
    "\\begin{bmatrix}\n",
    "[-1, 1] \\\\\n",
    "[-1, 1]\n",
    "\\end{bmatrix}\n",
    "\\end{align}\n",
    "\n",
    "`PyTorch` is somewhat complicated to understand if you are new to it and efficient usage of it in reinforcement learning looks different from the usage in \"traditional\" supervised learning. That is why we prepared the code for the network training this time. You only need to write a proper `normalize` function here, but of course feel free to explore the learning algorithm (e.g. experiment with different ANN topologies) as a preparation for the upcoming exercises. A more detailed explanation about the usage of `PyTorch` in this task will be presented in the corresponding tutorial (video)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-7b3dfb2ff2ab90d2",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "raise NotImplementedError()\n",
    "def normalize(state):\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return normed_state"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Model definition\n",
    "class FeedForwardNetwork(nn.Module):\n",
    "    def __init__(self, input_dim):\n",
    "        super(FeedForwardNetwork, self).__init__()\n",
    "        self.fc1 = nn.Linear(input_dim, 16)\n",
    "        self.fc2 = nn.Linear(16, 16)\n",
    "        self.fc3 = nn.Linear(16, 1)\n",
    "\n",
    "    def forward(self, x):\n",
    "        \"\"\"Prediction of the state-value for the given state.\"\"\"\n",
    "        \n",
    "        x = torch.relu(self.fc1(x))\n",
    "        x = torch.relu(self.fc2(x))\n",
    "        x = self.fc3(x)\n",
    "        return x\n",
    "\n",
    "def learn(model, state, next_state, reward, gamma, optimizer, mse_loss):\n",
    "    \"\"\"Learns from the interaction via bootstrapping and backpropagation.\n",
    "    \n",
    "    Implement the semi-gradient method in which the estimate for the \n",
    "    state-value at the next state is compared to the estimate of the \n",
    "    state-value at the current state + the gathered reward.\n",
    "\n",
    "    Args:\n",
    "        model: The state-value prediction model\n",
    "        state: The state of the environment before interaction\n",
    "        next_state: The state of the environment after interaction\n",
    "        reward: The reward for the interaction\n",
    "        gamma: Discount factor\n",
    "\n",
    "    Returns:\n",
    "        loss: The loss for the current learning step\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return loss"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-89f39019a556ab31",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# Environment setup\n",
    "env = gym.make('MountainCar-v0')\n",
    "\n",
    "# Hyperparameters\n",
    "alpha = 0.0001\n",
    "gamma = 0.9\n",
    "nb_episodes = 1000\n",
    "input_dim = env.observation_space.shape[0]\n",
    "\n",
    "# Initialize model, optimizer, and loss function\n",
    "model = FeedForwardNetwork(input_dim)\n",
    "optimizer = optim.SGD(model.parameters(), lr=alpha)\n",
    "mse_loss = nn.MSELoss()\n",
    "\n",
    "# Training loop\n",
    "errors = []\n",
    "visited_states = []\n",
    "\n",
    "for j in tqdm(range(nb_episodes)):\n",
    "    state, _ = env.reset()\n",
    "\n",
    "    while True:\n",
    "        next_state, reward, done = interact(state)\n",
    "\n",
    "        loss = learn(model, state, next_state, reward, gamma, optimizer, mse_loss)\n",
    "\n",
    "        state = next_state\n",
    "        \n",
    "        errors.append(loss.item())\n",
    "        visited_states.append(state)\n",
    "        \n",
    "        if done:\n",
    "            break\n",
    "\n",
    "visited_states = np.vstack(visited_states)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_tensor = np.column_stack([pos_mat.ravel(), vel_mat.ravel()])\n",
    "normalized_input = torch.tensor(normalize(input_tensor), dtype=torch.float32).float()\n",
    "value_mat = model(normalized_input).detach().numpy().reshape(pos_mat.shape)\n",
    "\n",
    "# Plotting Code\n",
    "plt.pcolormesh(pos_mat, vel_mat, value_mat)\n",
    "plt.xlabel(\"position\")\n",
    "plt.ylabel(\"velocity\")\n",
    "cbar = plt.colorbar()\n",
    "cbar.set_label(\"V\")\n",
    "_ = plt.scatter(visited_states[:, 0], visited_states[:, 1], color=\"red\", s=0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-e6cb807f6067db3b",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Try to  visualize the results from this task as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-c52bf5721f292caf",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "normalized_visited_states = torch.tensor(normalize(visited_states), dtype=torch.float32).float()\n",
    "visited_values = model(normalized_visited_states).detach().numpy()\n",
    "\n",
    "# Plotting code\n",
    "fig = plt.figure()\n",
    "ax = plt.axes(projection='3d')\n",
    "ax.plot_surface(pos_mat, vel_mat, value_mat, cmap=\"viridis\")\n",
    "ax.scatter(visited_states[:, 0], visited_states[:, 1], np.squeeze(visited_values), color=\"red\")\n",
    "ax.set_xlabel('\\n\\nposition')\n",
    "ax.set_ylabel('\\n\\nvelocity')\n",
    "ax.set_zlabel('V')\n",
    "ax.view_init(30, -135)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Create Assignment",
  "kernelspec": {
   "display_name": "RL25",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
